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Abstract 

Nonnegative Matrix Factorization (NMF) is a data analysis technique which allows compres- 
sion and interpretation of nonnegative data. NMF became widely studied after the publication 
of the seminal paper by Lee and Seung (Learning the Parts of Objects by Nonnegative Matrix 
Factorization, Nature, 1999, vol. 401, pp. 788 791), which introduced an algorithm based on Mul- 
tiplicative Updates (MU) . More recently, another class of methods called Hierarchical Alternating 
Least Squares (HALS) was introduced that seems to be much more efficient in practice. 

In this paper, we consider the problem of approximating a not necessarily nonnegative matrix 
with the product of two nonnegative matrices, which we refer to as Nonnegative Factorization (NF) ; 
this is the subproblem that HALS methods implicitly try to solve at each iteration. We prove that 
NF is NP-hard for any fixed factorization rank, using a reduction to the maximum edge biclique 
problem. 

We also generalize the multiplicative updates to NF, which allows us to shed some light on 

the differences between the MU and HALS algorithms for NMF and give an explanation for the 
better performance of HALS. Finally, we link stationary points of NF with feasible solutions of the 
biclique problem to obtain a new type of biclique finding algorithm (based on MU) whose iterations 
have an algorithmic complexity proportional to the number of edges in the graph, and show that 
it performs better than comparable existing methods. 

Keywords: Nonnegative Matrix Factorization, Nonnegative Factorization, Complexity, Multi- 
plicative Updates, Hierarchical Alternating Least Squares, Maximum Edge Biclique. 
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1 Introduction 



(Approximate) Nonnegative Matrix Factorization (NMF) is the problem of approximating a given 
nonnegative matrix by the product of two low-rank nonnegative matrices: given a matrix M > 0, one 
has to compute two low-rank matrices V,W such that 

M^VW. (1.1) 

This problem was first introduced in 1994 by Paatero and Tapper [25|, and more recently received 
a considerable interest after the publication of two papers by Lee and Seung |2T1 [22] . It is now well 
established that NMF is useful in the framework of compression and interpretation of nonnegative 
data ; it has for example been applied in analysis of image databases, text mining, interpretation of 
spectra, computational biology and many other applications (see e.g. [2l[9l[Tl] and references therein). 



How can one interpret the outcome of a NMF? Assume each column M-j of matrix M represents an 



element of a data set: expression (1.1 ) can be equivalently written as 

M.,j^Y.^--kWkj, Vj (1.2) 

k 

where each element M-j is decomposed into a nonnegative linear combination (with weights Wkj) 
of noimegative basis elements ({^fc}, the columns of V). Nonnegativity of V allows interpretation 
of the basis elements in the same way as the original nonnegative elements in M, which is crucial 
in applications where the nonnegativity property is a requirement (e.g. where elements are images 
described by pixel intensities or texts represented by vectors of word counts). Moreover, nonnegativity 
of the weight matrix W corresponds to an essentially additive reconstruction which leads to a part- 
hased representation: basis elements will represent similar parts of the columns of M. Sparsity is 
another important consideration: finding sparse factors improves compression and leads to a better 
part-based representation of the data [19] . 

We start this paper with a brief introduction to the NMF problem: Section [2] recalls existing 
complexity results, introduces two well-known classes of methods: multiplicative updates ^2] and 
hierarchical alternating least squares [6] and proposes a simple modification to guarantee their conver- 
gence. The central problem studied in this paper, Nonnegative Factorization (NF), is a generalization 
of NMF where the matrix to be approximated with the product of two low-rank nonnegative matrices 
is not necessarily nonnegative. NF is introduced in Section |3j where it is shown to be NP-hard for 
any given factorization rank, using a reduction to the problem of finding a maximum edge biclique. 
Stationary points of the NF problem used in that reduction are also studied. This section ends with 
a generalization of the NMF multiplicative updates rules to the NF problem and a proof of their 
convergence. This allows us to shed new light on the standard multiplicative updates for NMF: a new 
interpretation is given in Section [4| which explains the relatively poor performance of these methods 
and hints at possible improvements. Finally, Section [5] introduces a new type of biclique finding algo- 
rithm that relies on the application of multiplicative updates to the equivalent NF problem considered 
earlier. This algorithm only requires a number of operations proportional to the number of edges of 
the graph per iteration, and is shown to perform well when compared to existing methods. 



2 Nonnegative Matrix Factorization (NMF) 

Given a matrix M G M™^" and an integer r G Nq, the NMF optimization problem using the Frobenius 
norm is defined as 

min I |M - VW\ |l = V(M - FVF)?- such that V,W >0 (NMF) 
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j^mxn f;igjiotes the set of real matrices of dimension mxn; M™^" the set of nonnegative matrices i.e. 
j^mxn ^j^}^ every entry nonnegative, and the zero matrix of appropriate dimensions. 
A wide range of algorithms have been proposed to find approximate solutions for this problem (see 
e.g. ISllllEKTllIOllIIlIiaiMI)- Most of them use the fact that although problem ( |NMF[ ) is not convex, its 
objective function is convex separately in each of the two factors V and W (which implies that finding 
the optimal factor V corresponding to a fixed factor W reduces to a convex optimization problem, and 
vice- versa), and try to find good approximate solutions by using alternating minimization schemes. 
For instance, Nonnegative Least Squares (NNLS) algorithms can be used to minimize (exactly) the 
cost function alternatively over factors V and W (see e.g. [5l [20]'). 

Actually, there exist other partitions of the variables that preserve convexity of the alternating 
minimization subproblems: since the cost function can be rewritten as \\M — it is 

clearly convex as long as variables do not include simultaneously an element of a column of V and 
an element of the corresponding row of W (i.e. V^i and Wu for the same index i). Therefore, given 
a subset of indexes K Q R = {1,2, 
variables 

PK = {v.\ieK^ u {Wj: \ jeR\K^ 

and its complement 



, r}, (NMF) is clearly convex for both the following subsets of 



QK = {Vi\ieR\K} u [Wj..\jeK} 



However, the convexity is lost as soon as one column oi V (V-i) and the corresponding row of W 
(Wi;) are optimized simultaneously, so that the corresponding minimization subproblem can no longer 
be efficiently solved up to global optimality. 



2.1 Complexity 

Vavasis studies in [30] the algorithmic complexity of the NMF optimization problem; more specifically, 
he proves that the following problem, called Exact Nonnegative Matrix Factorizatior^ is NP-hard: 

(Exact NMF) Given a nonnegative matrix M > of rank k, find, if possible, two nonneg- 
ative factors V > and > of rank k such that M = VW. 



The NMF optimization problem is therefore also NP-hard, since when the rank r is equal to the rank 
k of the matrix M, any optimal solution to the NMF optimization problem can be used to answer the 
Exact NMF problem (the answer being positive if and only if the optimal objective value of the NMF 
optimization problem is equal to zero). 

The NP-hardness proof for exact NMF relies on its equivalence with a NP-hard problem in poly- 
hedral combinatorics, and requires both the dimensions of matrix M and its rank k to increase to 
obtain NP-hardness. In contrast, in the special cases when rank k is equal to 1 or 2, the exact NMF 
problem can always be answered in the affirmative: 

1. When A; = 1, it is obvious that for any nonnegative rank-one matrix M > there is nonnegative 
factors V > and u; > such that M = vvj^ . 

Moreover, the NMF optimization problem with r = 1 can be solved in polynomial time: the 
Perron-Frobenius theorem implies that the dominant left and right singular vectors of a nonneg- 
ative matrix M are nonnegative, while the Eckart- Young theorem states that the outer product 
of these dominant singular vectors is the best rank-one approximation of M; these vectors can 
be computed in polynomial-time using for example the singular value decomposition |15j . 

^This is closely related to the nonnegative rank of matrix M, which is the minimum value of r for which there exists 
V e R^'"' and W G W^'' such that M = VW (see [I]). 
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2. When nonnegative matrix M has rank 2, Thomas has shown [29] that exact NMF is also always 
possible (see also [8]). The fact that any rank- two nonnegative matrix can be exactly factorized 
as the product of two rank-two nonnegative matrices can be explained geometrically as follows: 
viewing columns of M as points in M™, the fact that M has rank 2 implies that the set of 
its columns belongs to a two-dimensional subspace. Furthermore, because these columns are 
nonnegative, they belong to a two-dimensional pointed cone, see Figure [TJ Since such a cone is 
always spanned by two extremes vectors, this implies that all columns of M can be represented 
exactly as nonnegative linear combinations of two nonnegative vectors, and therefore the exact 
NMF is always possibl^ 








Figure 1: Rank-two exact NMF (A; = 2): m = 3 and n = 10. 



Moreover, these two extreme columns can easily be computed in polynomial time (using for 
example the fact that they define an angle of maximum amplitude among all pairs of columns). 
Hence, when the optimal rank-two approximation of matrix M is nonnegative, the NMF opti- 
mization problem with r = 2 can be solved in polynomial time. However, this optimal rank-two 
approximation is not always nonnegative, so that the complexity of the NMF optimization in 
the case r = 2 is not known. Furthermore, to the best of our knowledge, the complexity of the 
exact NMF problem and the NMF optimization problem are still unknown for any fixed rank r 
or k greater than 3. 



2.2 Multiplicative Updates (MU) 



In their seminal paper [22] , Lee and Seung propose multiplicative update rules that aim at minimizing 
the Frobenius norm between M and VW. To understand the origin of these rules, consider the 
Karush-Kuhn- Tucker first-order optimality conditions for (NMF) 

V>0, W>0 (2.1) 

Vv\\M -VW\\%>0, Vw\\M -VW\\l>0 (2.2) 
V oVv\\M -VW\\l = 0, W oVw\\M -VW\\l = (2.3) 

where o is the Hadamard (component-wise) product between two matrices, and 

Vv\\M -VW\\l = -2{M -VW)W^, Vw\\M -VW\\l = -2V^{M -VW) . (2.4) 



■^The reason why this property no longer holds for higher values of the rank k is that a fc-dimensional cone is not 
necessarily spanned by a set of k vectors when k > 2. 
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Injecting (2.4) in (2.3), we obtain 

Vo{VWW^) = Vo{MW^) (2.5) 

W o {V^VW) = Wo {V^M) . (2.6) 

From these equalities, Lee and Seung derive the following simple multiplicative update rules (where 
|j is Hadamard (component-wise) division) 

V^Vo [^^"] W^Wo (2.7) 

[vww^y [v^vw] ^ ' 

for which they are able to prove a monotonicity property: 

Theorem 1 (||22j). The Frohenius norm \ \M — VW\\f is nonincreasing under the multiplicative update 



rules (2.7). 



The algorithm based on the alternated application of these rules is not guaranteed to converge to a 
first-order stationary point (see e.g. [2j, and references therein), but a slight modification proposed 
in [23] achieves this property (roughly speaking, MU is recast as a variable metric steepest descent 
method and the step length is modified accordingly). We propose another possibility to overcome this 
problem by replacing the above updates by the following: 

Theorem 2. For every constant e > 0, \\M — VW\\f is nonincreasing under 

for any (V, W) > e. Moreover, every limit point of this algorithm is a stationary point of the following 
optimization problem 

min \\M-VW\\l. (2.9) 

V>t,W>t 



Proof. See Section 3.3 of this paper, where the more general Theorem [9] is proved. □ 



2.3 Hierarchical Alternating Least Squares (HALS) 

Cichoki et al. [6^ and independently several other authors |14| [T8] have proposed to solve the problem 
of Nonnegative Matrix Factorization by considering successively each rank-one factor V-kWk- while 
keeping the rest of the variables fixed, which can be expressed as 

M ^V,kWk:+Y,^--i^i-- ^ V:kWk: ^ M -Y^y-.iWi: OI V.kWk: ^ Rk (2.10) 

where matrix R^ is called the fc*^ residual matrix. 

Ideally, one would like to find an optimal rank-one factor V-kWk-. according to the Frobenius norm, 
i.e. solve the following problem 

min \\M - VW\\l = \\Rk - VkWk-\\l such that K^, I7fc. > (2.11) 

but, instead of solving this problem directly, these authors propose to optimize column V:k and row Wk-, 
separately in an alternating scheme, because the optimal solution to these two (convex) subproblems 
can be easily computed in closed form, see e.g. |16j : 

V.;^ = e.i^gmmy,^y,\\Rk-VkWk:\\l = max (^0, (2.12) 
Wl^ = aigmmw^,^o\\Rk-V.,kWk:\\l = max^O,^^^^ (2.13) 



5 



This scheme, which amounts to a block coordinate descent method (for which any cychc order on 
the columns of V and the rows of W is admissible), is called Hierarchical Alternating Least Squares 
(HALS)[^and it has been observed to work remarkably well in practice: it outperforms, in most cases, 
the other algorithms for NMF [71 [TSl [T6] . Indeed, it combines a low computational cost per iteration 
(the same as the multiplicative updates) with a relatively fast convergence (significantly faster than 
the multiplicative updates), see Figure [s] for an example. We will explain later in Section |4] why this 
algorithm performs much better than the one of Lee and Seung. 

A potential issue with this method is that, in the course of the optimization process, one of the 
vectors V-k (or Wk-.) and the corresponding rank-one factor V±Wk: may become equal to zero (this 
happens for example if one of the residuals Rk is nonpositive) . This then leads to numerical instabilities 
(the next update is not well-defined) and a rank-deficient approximation (with a rank lower than r). A 



possible way to overcome this problem is to replace the zero lower bounds on V-t and Wk- in (2.12 ) and 



(2.13 ) by a small positive constant, say e <^ 1 (as for the MU), and consider the following subproblems 
V*f, = argmin^^>J|i?fc - V.,kWk:\\l and W^, = aigmm^^ y^WRk - V.,kWk:\\l, (2.14) 
which lead to the modified closed-form update rules: 



This idea was already suggested in [6j in order to avoid numerical instabilities. In fact, this variant of 



the algorithm is now well-defined in all situations because (2.14) guarantees V-k > and Wk- > at 



each iteration. Furthermore, one can now easily prove that it converges to a stationary point. 
Theorem 3. For every constant e > 0, the limit points of the block coordinate descent algorithm 



initialized with positive matrices and applied to the optimization problem (2.9) are stationary points. 



Proof. We use the following result of Powell p7] (see also [3l p.268]): the limit points of the iterates of 
a block coordinate descent algorithm are stationary points provided that the following two conditions 
hold: 

• each block of variables is required to belong to a closed convex set, 

• the minimum computed at each iteration for a given block of variables is uniquely attained. 
The first condition is clearly satisfied here, since V-k and Wk-. belong respectively to ([e, -|-oo[)™' and 



([e, -|-oo[)", which are closed convex sets. The second condition holds because subproblems (2.14) 



can be shown to be strictly convex, so that their optimal value is uniquely attained by the solutions 



provided by rules (2.15). Strict convexity is due to the fact that the objective function of these 
problems are sums of quadratic terms, each involving a single variable and having a strictly positive 
coefficient. □ 



3 Nonnegative Factorization (NF) 



Looking back at subproblem (2.11), i.e. approximating the residual Rk with a rank-one term V-k^k-., 



we have seen that the optimal solution separately for both Vk and Wk-. can be written in a closed form. 



In the previous section, subproblem (2.11) was then solved by a block coordinate descent algorithm. 



A question arises: Is it possible to do better? i.e. Is it possible to efficiently solve the problem for 
both vectors simultaneously? In order to answer this question, we introduce the problem of Nonnegative 



^In dnmi], it is called Rank-one Residue Iteration (RRI) method and in [H] Alternating NMF (ANMF). 
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FactorizatioE0 which is exactly the same as Nonnegative Matrix Factorization except that the matrix 
to factorize can be any real matrix, i.e. is not necessarily nonnegative. Given M G ]^™x" and r G Nq, 
the Nonnegative Factorization optimization problem using the Frobenius norm is: 



mm 



\M -VWWj, 
V>0, W>0 



(NF) 



Of course, this problem is a generalization of (NMF) and is NP-hard as well. However Nonnegative 
Factorization will be shown below to be NP-hard for any fixed factorization rank (even r = 1), which 



is not the case of (NMF) (cf. Section 2.1 ). The proof is based on the reduction to the maximum edge 
biclique problem. 



3 . 1 C omplexity 



The main result of this section is the NP-hardness result for Nonnegative Factorization for any fixed 
factorization rank. We first show how the optimization version of the maximum edge biclique problem 
(MBP) can be formulated as a rank-one Nonnegative Factorization problem (NF-ld). Since the 
decision version of (MBP) is NP-complete [26], this implies that (NF-ld) is NP-hard. We then prove 



that (NF) is NP-hard as well using a simple construction. 



The Maximum Edge Biclique Problem in Bipartite Graphs 

A bipartite graph Gb is a graph whose vertices can be divided into two disjoint sets Vi and V2 such 
that there is no edge between two vertices in the same set 

Gb = {V, E) = (Vi UV2,EC {Vi X V2)) . 

A biclique Kb is a complete bipartite graph i.e. a bipartite graph where all the vertices are connected 

Kb = {V, E') = {vi U Vi, E' = {V; X ^2')) • 

Finally, the so-called maximum edge biclique problem in a bipartite graph Gb = (V, E) is the prob- 
lem of finding a biclique Kb = {V ,E') in Gb (i.e. V' V and E' C E) maximizing the number of 
edges. The decision problem: Given B, does Gb contain a biclique with at least B edges? has been 
shown to be NP-complete |26j. Therefore the corresponding optimization problem is at least NP-hard. 

Let Mb £ {0, l}™><" be the adjacency matrix of the unweighted bipartite graph Gb = (Vi U V2, E) 
i.e. Mb{i,j) = 1 if and only if (Vi(i), V2(j)) G E. In order to avoid trivialities, we will suppose that 
each vertex of the graph is connected to at least one other vertex i.e. Mb{i, :) 7^ 0, Mb{:,j) 7^ 0, Vi, j. 
We denote by \E\ the cardinality of E i.e. the number of edges in Gb] note that \E\ = \ \Mb\\p. The set 
of zero values will be called Z = {{i,j) \ Mb{i,j) = 0}, and its cardinality \Z\ satisfies \E\ + \Z\ = mn. 

With this notation, the maximum biclique problem in Gb can be formulated as 

min \\Mb — vwWj^ 

G {0,l}™,u; G {0,1}" (MBP) 
ViWj < Mb{i,j), 

*This terminology has already been used for the problem of finding a symmetric nonnegative factorization, i.e. one 
where V=W, but we assign it a different meaning in this paper. 
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In fact, one can check easily that this objective is equivalent to maxy^w '^i'Wj since Mfe, v and w 
are binary: instead of maximizing the number of edges inside the biclique, one minimizes the number 
of edges outside. 



Feasible solutions of (MBP) correspond to bicliques of Gh- We will be particularly interested in 



maximal bicliques. A maximal biclique is a biclique which is not contained in a larger biclique: it is 



a locally optimal solution of ( MBP ) . 



The corresponding rank-one Nonnegative Factorization problem is defined as 

\\Md 



mm 



vw\\\ 



v > 0, w > 



with the matrix M,i defined as 



Md = (1 + d)Mi, -dlr 



d>0 



(NF-ld) 



(3.1) 



where Imxn is the matrix of all ones with dimension m x n. is the matrix Mf, where the zero 
values have been replaced by —d. Clearly is not necessarily a nonnegative matrix. 



To prove NP-hardness of (NF-ld), we are going to show that, if d is sufficiently large, optimal solutions 



of (NF-ld) coincide with optimal solutions of the corresponding biclique problem (MBP). From now 



v'w' 



on, we say that a solution {v,w) coincides with another solution {v',w') if and only if vw 
(i.e. if and only if v' = Xv and w' = X^^w for some A > 0). We also let M_|_ = max(0,M) and 
M- = max(0,-M). 

Lemma 1. Any optimal rank-one approximation with respect to the Frobenius norm of a matrix M 
for which min(M) < — ||M+||i7 contains at least one nonpositive entry. 

Proof. If M = 0, the result is trivial. If not, min(Af) < since min(M) < — HM+Hj?. Suppose now 
{v, w) > is a best rank-one approximation of M. Therefore, since the negative values of M are 
approximated by positive ones and since M has at least one negative entry, we have 



||M - vw\\p > ||M_|||,. 

By the Eckart- Young theorem, 

||M - VW\\1 = \\M\\l - GmaxiMf = \\M\\l 

where CTmaxiM) is the maximum singular value of M. Clearly, 

^M+\\p + \\M^\\p and 

So 



(3.2) 



\M\\l 



||M|||, = IIM+III, ||M„|||, and ||M||^ > min(M)^ 
||M - vw\\p < ||M+|||, + ||M_|||. - min(M)^ < ||M_||ir 



which is in contradiction with (3.2). 



□ 



We restate here a well-known result concerning low-rank approximations (see e.g. |16l p. 29]). 

Lemma 2. The local minima of the best rank-one approximation problem with respect to the Frobenius 
norm are global minima. 



We can now state the main result about the equivalence of (NF-ld) and (MBP) 
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Theorem 4. Ford > y any optimal solution (v,w) of (NF-ld) coincides with an optimal solution 
of (MBP), i.e. vw is binary and vw < M^. 



Proof. We focus on the entries of vw which are positive and define 



-RT = |z e {1,2, . . . ,m} > o| and L = | j € {1, 2, . . . , n} > o|. 



(3.3) 



v' = v{K), w' = w{L) and = M(i{K,L) are the submatrices with indexes in (K,L). Since {v,w) 
is optimal for M^, {v',w') must be optimal for M^. Suppose there is a —d entry in M^, then 

min(M^) = -d< = "11 W + < -||(M^) + ||f, 

so that Lemma [T] holds for M'^. Since {v',w') is positive and is an optimal solution of (NF-ld) 
for M^, (v',w') is a local minimum of the unconstrained problem i.e. the problem of best rank- 
one approximation. By Lemma [2| this must be a global minimum. This is a contradiction with 
Lemma llj {v',w') should contain at least one nonpositive entry. Therefore, = 1|_r-|x|l| which 
implies v w' = M'^ by optimality and then vw is binary and vw < M^. □ 



Corollary 1. Rank-one Nonnegative Factorization is NP-hard. 



Intuitively, the reason (NF-ld) is NP-hard is that if one of the —d entries of is approximated 
by a positive value, say p, the corresponding error is d'^ + 2pd +p'^. Therefore, the larger d, the more 
expensive it is to approximate —d by a positive number. Because of that, when d increases, negatives 
values of will be approximated by smaller values and eventually by zeros. 

Therefore, for each negative entry of M, one has to decide whether to approximate it with zero 
or with a positive value. Moreover, when a value is approximated by zero, one has to choose which 
entries of V and W will be equal to zero, as in the biclique problem. We suspect that the hardness of 
Nonnegative Factorization lies in these combinatorial choices. 

We can now answer our initial question: Would it be possible to solve efficiently the problem 

VkWk: ^Rk = M-Y, y-.iWi: t 

simultaneously for both vectors iV-k^Wk-)! Obviously, unless P=NP, we won't be able to find a 
polynomial-time algorithm to solve this problem. Therefore, it seems hopeless to improve the HALS 
algorithm using this approach. 



Remark 1. Corollary^ suggests that NMF is a difficult problem for any fixed r > 2. Indeed, even 
if one was given the optimal solution of an NMF problem except for one rank-one factor, it is not 
guaranteed that one would be able to find this last factor in polynomial-time since the corresponding 
residue is not necessarily nonnegative. 

We now generalize Theorem [T] to factorizations of arbitrary rank. 



Theorem 5. Nonnegative Factorization (NF) is NP-hard. 

Proof. Let Mf, € {0, 1}™^" be the adjacency matrix of a bipartite graph Gf, and r > 1 the factorization 
rank of (NF). We define the matrix Af^ as 



A = diag(Mfe,r) 



/Mb 
Mh 

\ ... 



\ 


Mb J 
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which is the adjacency matrix of another bipartite graph which is nothing but the graph Gb 
repeated r times. A^^ is defined in the same way as M^, i.e. 

Ad = {1+ d)Ah -dlmxn 



with d > ^yr\E\. Let {V, W) be the optimal rank-r nonnegative factorization of A^^ and consider each 
rank-one factor VkWk-. ^ Rk = A^ — Yli^k ^.i^i-'- each of them must clearly be an optimal rank-one 
nonnegative factorization of R^. Since Rk < A^, 

mm{Rk) < mm{Ad) = -d < -\\{Ad)+\\F < -\\{Rk) + \\F, 

and Lemma [T] holds. Using exactly the same arguments as in Theorem |4j one can show that, V/c, 

{VkWk;)ij = 0, s.t. Ad{iJ) = -d. 

Therefore, the positive entries of each rank-one factor will correspond to a bi clique of G^. By optimality 
of (y, VF), each rank-one factor must correspond to a maximum biclique of Gi, since is the graph 
Gb repeated r times. Thus (INF-Idl) is NP-hard imphes dNFt) is NP-hard. □ 



3.2 Stationary points of (NF-ld) 



We have shown that optimal solutions of (NF-ld) coincide with optimal solutions of (MBP) for 



d> VE, which are NP-hard to find. In this section, we focus on stationary points of (NF-ld) instead: 
we show how they are related to the feasible solutions of ( MBP ) . This result will be used in Section 
|5]to design a new type of biclique finding algorithm. 

3.2.1 Definitions and Notation 



The KKT conditions of (NF-ld), which define the stationary points, are exactly the same as for 
(NMF): {v, w) is a stationary point of (NF-ld) if and only if 



V >0, jJL = {vw — Md)w^ 
w>0, A = v^{vw - Md) 



> and t; o ^ = 

> and w o X = 0. 



(3.4) 
(3.5) 



Of course, we are especially interested in nontrivial solutions and we then assume VjW ^ so that 
one can check that (3.4)-(3.5) are equivalent to 



max I 



\w\ 



(3.6) 



Given d, we define three sets of rank-one matrices: S^, corresponding to the nontrivial stationary 
points of dNF-ldL with 



Sd = {vw G 



{v, w) satisfy (3.6)}, 



F, corresponding to the feasible solutions of (MBP), with 

F = {vwe M™' 



{v,w) is a feasible of (MBP)}, 



and B, corresponding to the maximal bicliques of (MBP), i.e. vw G -B if and only if vw E F and vw 
coincides with a maximal biclique. 
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3.2.2 Stationarity of Maximal Bicliques 



The next theorem states that, for d sufficiently large, the only nontrivial feasible solutions of (MBP) 



that are stationary points of (NF-ld) are the maximal bicliques 



Theorem 6. For d > max(rn,, n) — 1, F D Sd = B. 
Proof, vw G i? if and only if vw £ F and is maximal i.e. 

(1) $i such that Vi = and Md{i,j) = l,Vj s.t. Wj 7^ 0, 

(2) $j such that wj = and Md{i,j) = l,Vz s.t. Vi ^ 0. 

Since vw is binary and w 7^ 0, the nonzero entries of v and w must be equal to each other. Moreover, 
d > max(m, n) — 1 so that (1) is equivalent to 

$ i such that Vi = and Md{i, :)w^ > 
Md{^,■.)w^ < and ^ = = 







This is equivalent to the stationarity conditions for v ^ 0, cf. (3.6). By symmetry, (2) is equivalent 
to the stationarity conditions for w. □ 

Theorem |6] implies that, for d sufficiently large, B C Sd- It would be interesting to have the opposite 
affirmation: for d sufficiently large, any stationary point of (NF-ld) corresponds to a maximal biclique 



of (MBP). Unfortunately, we will see later that this property does not hold. 



3.2.3 Limit points of Sd 

However, as d goes to infinity, we are going to show that the points in Sd get closer to feasible solutions 



of (MBP). 



Lemma 3. The set Sd is bounded i.e. \/d > 0, yvw G Sd: 

\\vw\\2 = \\v\\2\\w\\2 < vl-E'l. 



Proof. For vw G Sd, by (3.6) 



in ^dW^ 
max ( 0, -|-| — 



\w\ 



^ ||max(0,Mrf)w'^||2 ^ || max(0,Mrf)||F ^ V\E\ 



\w\ 



\W\\2 



\W\\2 



Lemma 4. For vw G Sd, if Md{i,j) = —d and if {vw)ij > 0, then 



Proof By (3.6), 



\\v\\i W'^Wi 
< Vi < - — - and < Wj < - — -, 
d+1 ^ d+1 



v\\i 



0<Wj\\v\\i = v' Md{:,j) <\\v\\i-{d+l)vi 0<Vi<^^ 
The same can be shown for w by symmetry. 



□ 



□ 
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Theorem 7. As d goes to infinity, stationary points of (NF-ld) get closer to feasible solutions of 

(3.7) 



dMBPl) i.e. Ve > 0, 3D s.t. W > D: 



max mm \\vw — VhWh\\F < £■ 



Proof. Let vw G 5^ and suppose > 0. W.l.o.g. 111^112 = 1; in fact, iivw £ Sd, (^Xvju?j £ Sd,y\ > 0. 
Note that Lemma [s] implies ||f||2 < Y^|i?|. By (3.6), 



V = MdW and w = ,, . 

11^^112 

Therefore, (f /| I2, if^) > is a pair of singular vectors of Md associated with the singular value 
\\v\\2 > 0. If Md = Imxn, the only pair of positive singular vectors of Md is ( -j=lm, ) so that 

vw = Mf, coincides with a feasible solution of ( MBP ) . 
Otherwise, we define 



A = {i\ MdiiJ) = l,Vj} and B = [j \ Md{i,j) = l,Vi} 



and their complements A = {1, 2, . . . , m}\A, B = {1, 2, . . . , n}\B; hence. 



Md{A, :) 



^\A\xn 



and Md(:,5) = l^x|B|- 



Using Lemma[4]and the fact that ||x||i < \/n||x||2,Vx E M", we get 

< v{A) < ^^^}^^ 1\A\ and < w{B) < 1 



d + 1 

Therefore, since w < 1„, and v < \/\E\ !„, we obtain 

1 



d+1 



\B\- 



(3.9) 



\v{A)w -0\\f < 



d+1 



m^ 



^/n\E\ ) and \\vw{B)-0\\f< 



d+1 



n 



\J •m\E\ 



It remains to show that v{^A)w{B^ coincide with a biclique of the (complete) graph generated by 
Mf,(j4,i?) = l|^|x|_B|- We distinguish three cases: 

(1) A = %. ( [3^ implies ||f||2 < -^{m^J m\E\^ so that 



w{B) 



'-mx\B\ 

\v\ 



l^lli 



12 ll^ll2 
which is absurd if d > my^\E\ since 1110112 = 1- 



12 "-m 



m 



\V\\2 



> 



d + 1 
m\/\E 



L|i?| 



(3.10) 



(2) 5 = 0. Using ( [3^ , we have v{A) = Md{A, ■.)vF = \ \w\\i1\a\ < and then 



\v{A)w{B) - 0\\f < 



d + 1 



(3) ^, S / 0. Noting ku 
1 - \B\ 



Pill 



^, Equation (3.10) gives w{B) = kw 1|_b|- Therefore, 



n 



'd+1 



< \\w\ 



\w{B) 



\w{B)\\l = \B\kl<\\w\\l 



Moreover, v{^A) = l|^|xm'"^'^ = ll^l|il|A| so that 

|5|A;«; < v{A) = (||u;(5)||i + ||w;(S)||i)l|^| < \B\k^ + \B\ 



n 



d + 1 



(3.11) 



(3.12) 
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Finally, combining (3.11) and (3.12) and noting that k^, < 1 since ||tL'||2 = 1, 



1 



d + l 



l|A|x|B| < v{A)w{B) < 1 + 



d + l 



L|A|x|B|- 



We can conclude that, for d sufficiently large, vw is arbitrarily close to a feasible solution of (MBP) 
which corresponds to the biclique {A, B). 



Recall we supposed vw > 0. If vw 0, let {K,L) be the indexes defined in (3.3). The above result 
holds for v{K)w{L) > with the matrix Md{K,L). For d sufficiently large, v{K)w{L) is then close 
to a feasible solution of (MBP) for Md{K,L). Adding zero to this feasible solution gives a feasible 
solution for M^. □ 



Example 1. Let 



-d 1 
1 1 



Clearly, 



belongs to the set B, i.e. it corresponds to maximal biclique of the graph generated 



by M},. By Theorem^ for d > 1, it belongs to Sd i.e. [(0 1)^, (0 1)] is stationary points of (NF-ld). 
For d > 1, one can also check that the singular values of are disjoint and that the second pair 
of singular vectors is positive. Since it is a positive stationary point of the unconstrained problem, 
it is also a stationary point of ( NF-ld[ ). As d goes to infinity, it must get closer to a biclique of 
dMBPt (Th eorem^. Moreover is symmetric so that the right and left singular vectors are equal 
to each other. Figure\^ shows the evolutior^ of this positive singular vector of Md with respect to d. It 

converges to (0 1) and then the product of the left and right singular vector converges to 



1 



G F. 



1 














d=100 ^ 




0.95 


d = 


= 10 \ 








0.9 




d = 1 \ 


0.85 




d = 



0.1 0.2 0.3 0.4 0.5 



Figure 2: Evolution of (viV2)"^(wif2) € Sd- 



^By Wedin's theorem (cf. matrix perturbation theory, see e.g. [28]), singular subspaces of Md associated with a 
positive singular value are continuously deformed with respect to d. 
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3.3 Multiplicative Updates for Nonnegative Factorization 



In this section, the MU of Lee and Seung presented in Section 2.2 to find approximate solutions of 
(NMF) are generahzed to (NF). Other than providing a way of computing approximate solutions 
of (NF), this result will also help us to understand why the updates of Lee and Seung are not very 
efficient in practice. 

The Karush-Kuhn- Tucker optimality conditions of the (NF) problem are the same as for (NMFJ) (see 
Section 2.2). Of course, any real matrix M can be written as the difference of two nonnegative 
matrices: M = P — N with P,N >0. This can be used to generalize the algorithm of Lee and Seung. 
In fact, (2.5) and (2.6) become 



V o {VWW^ + NW^) 
W o {V^VW + V^N) 



V o (PW^) 
W o (V^P) 



(3.13) 
(3.14) 



and using the same idea as in Section 2.2 we get the following multiplicative update rules : 

Theorem 8. For V,W > and M = P - N with P,N > 0, the cost function \\M - VW\\f is 
nonincreasing under the following update rules: 



[PW 



[V^P] 



[VWW^ + NW^] ' 



W ^W' 



(3.15) 



Proof. We only treat the proof for V since the problem is perfectly symmetric. The cost function can 
be split into m independent components related to each row of the error matrix, each depending on a 
specific row of P, N and V, which we call respectively p, n and v. Hence, we can treat each row of V 
separately, and we only have to show that the function 



F{v) 



1, 



\p - n - vW\\p. 



is nonincreasing under the following update 



■uo ^ ""0 o 



[vqWW^ + nW^] 



> 0. 



(3.16) 



F is a quadratic function so that 



F{v) = F{vo) + {v- vo)VF{vo) + -{v - vo)V^F{vq){v - vq)' 



with VF{vq) = {—p + n + vqW)W^ and V'^F{vq) = WW^ . Let G be a quadratic model of F around 
vq: 

G{v) = F{vo) + {v- vo)VF{vo) + -{v - vo)K{vo){v - vof 



with K{vo) = diag 



[voWW'^+nW'^] 



. G has the following nice properties (see below): 



[1) G is an upper approximation of F i.e. G{v) > F{v),yv; 



(2) The global minimum of G{v) is nonnegative and given by (3.16). 



Therefore, the global minimum of G, given by (3.16), provides a new iterate which guarantee the 
monotonicity of F. In fact. 



F{vo) = G{vo) > mmG{v) = G{v*) > F{v*). 
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It remains to show that (1) and (2) hold. 

(1) G{v) > F{v) \/v. This is equivalent to K{vq) — WW'^ positive semidefinite (PSD). Lee and Seung 
have proved 122| that A = diag^^^^^^^J^) - WW^ is PSD (see also [TT). Since B = diag^^^^gp) is 

a diagonal nonnegative matrix for vq > and nW"^ > 0, A + B = K{vq) — WW^ is also PSD. 

(2) The global minimum of G is given by (3.16): 

V* = argmin^G(i;) = vq — {vq)V F {vq) 



VQ - VQO 



Vo o 



[voWW^ + nW^] 



□ 

As with standard multiplicative updates, convergence can be guaranteed with a simple modifica- 
tion: 

Theorem 9. For every constant e > and for M = P — N with P,N > 0, \\M — VW\\f is 
nonincreasing under 

V ^ max 1/ o ,„^,,^T7U^] )■ --""('■ ^ ° pztXTUa^] ) ^'-"^ 

for any {V, W) > e. Moreover, every limit point of this algorithm is a stationary point of the opti- 
mization problem (2.9). 



Proof. We use exactly the same notation as in the proof of Theorem 3.15 so that 

F{vo) = G{vo) > min G{v) = G{v*) > F{v*), vq > e 

v>e 

remains valid. By definition, K{vq) is a diagonal matrix implying that G{v) is the sum ofr independent 
quadratic terms, each depending on a single entry of v. Therefore, 

argmm,>,G(.) = max (e, .o o [,^^^t + ^^T] J ' 
and the monotonicity is proved. 

Let (y, W) be a limit point of a sequence {{V^ , W^)} generated by (3.17). The monotonicity implies 
that {||M — converges to \\M — VW\\f since the cost function is bounded from below. 

Moreover, 



Vifc = max ( e, ajfc ^ifc ) , Vi,A; (3.18) 

where 



T 



PiWi 



Vi-WWl + NiWl' 



which is well-defined since Vi-WW^. > 0. One can easily check that the stationarity conditions of (2.9) 
for V are 

Vik > e. Oik < 1 and {Vik - e) {oik - 1) = 0, Vi, k. 



Finally, by (3.18), we have either Vik = e and Oik < 1, or Vik > e and Oik = l,\/i,k. The same can be 
done for W by symmetry. □ 
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In order to implement the updates (3.15), one has to choose the matrices P and N. It is clear that 
VP, > such that M = P — N , there exists a matrix C > such that the two components P and 

can be written P = + C and = M_ + C. When C goes to infinity, the above updates do 
not change the matrices V and W, which seems to indicate that smaller values of C are preferable. 
Indeed, in the case r = 1, one can prove that C = is an optimal choice: 



Theorem 10. VP, iV > s.t. M = P-N, and Vw G M!^, u; G W^: 

\\M — viw\\p < \ \M — V2w\\f < \\M — vwWf, 

for 



(3.19) 



Vl 



V O 



[vww'^ + M-w'^] 



and V2 = V o 



[Pw^ 



vww^ + Nw' 



Tl • 



Proof. The second inequality of (3.19) is a consequence of Theorem 3.15 For the first one, we treat 
the inequality separately for each entry v i.e. we prove that 

\\Mi: - VuwWf <\\Mi., - V2iW\\F, Vi. 

Let define as the optimal solution of the unconstrained problem i.e. 



V* = argmin^^ 1 1 M, 

and a,b, d > 0, e > 0, as 

a = {M+)i,nF, b = {M_)i.,nF 
Noting that P - M+ = TV - M_, we have 



ViW\\F 



Vli 



d = {P — M^)i-w and e = Viww 
, a + d 

and V2i 



e + b 



b + d 



Suppose V* > Vi. Therefore, 



> 1 



6- e > 



> 



e + b' 



Moreover, 1 < since V2i is a better solution than Vi (Theorem 3.15). Finally, 

a — b 



a + d 
1 < z : < 



e + b + d ~ e + b 



< 



Vi < V2i < Vli < Vi . 



The case vf < Vi is similar. 



□ 



Unfortunately, this result does not hold for r > 1. This is even true for nonnegative matrices, i.e. 
one can improve the effect of a standard Lee and Seung multiplicative update by using a well-chosen 
matrix C. 



Example 2. With the following matrices 



M 
















• 


1 








i) 




1 











w 



1 
1 1 



and C 




we have \\M — V'W\\f < \\M — V"W\\f where V (resp. V") is updated following (3.15) using 
P = M + CandN = C (resp. P = M and N = 0). 

However, in practice, it seems that the choice of a proper matrix C is nontrivial and cannot accelerate 
significantly the speed of convergence. 
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4 How good are the Multiplicative Updates (MU) of Lee and Seung? 



In this section, we use Theorem [s] to interpret the multiphcative rules for ( |NMF ) and show why the 
HALS algorithm performs much better in practice. 

4.1 An Improved Version of the MU 

The aim of the MU is to improve a current solution (V, W) > by optimizing alternatively V {W 
fixed), and vice- versa. In order to prove the monotonicity of the MU, ||M — was shown to be 

nonincreasing under an update of a single row of V (resp. column of W) since the objective function 
can be split into m (resp. n) independent quadratic terms, each depending on the entries of a row of 
V (resp. column of W); cf. proof of Theorem [sj 

However, there is no guarantee, a priori, that the algorithm is also nonincreasing with respect to 
an individual update of a column of V (resp. row of W). In fact, each entry of a column of V (resp. 
row of W) depends on the other entries of the same row (resp. column) in the cost function. The next 
theorem states that this property actually holds. 



Corollary 2. For V,W >0, \\M 



VW\\f is nonincreasing under 



[vwwl] ' 



[vlvw] 



yk, 



(4.1) 



i.e. under the update of any column ofV or any row ofW using the MU (2.7). 



Proof. This is a consequence of Theorem using P = M and = "^^^^ V-iWi-. 
In fact, \\M-VW\\F = \\{M-Y.^^^V,M-)-V±Wk-\\F- 



□ 



Corollary [2] sheds light on a very interesting fact: the multiplicative updates are also trying to 
optimize alternatively the columns of V and the rows of W using a specific cyclic order: first, the 
columns of V and then the rows of W . We can now point out two ways of improving the MU: 

1. When updating a column of V (resp. a row of W), the columns (resp. rows) already updated 
are not taken into account: the algorithm uses their old values; 



2. The multiplicative updates are not optimal: P ^ M+ and N ^ M_ (cf. Theorem 10). Moreover, 



there is actually a closed-form solution for these subproblems (cf. HALS algorithm. Section 2.3). 



Therefore, using Theorem 10, we have the following new improved updates 

ypV^lli;' is nonincreasing under 



Corollary 3. For V,W>0, \\M 



V-u 



[V:kWkWi+ {Rk)-Wi 



[V,iVkWk: + V.i{Rk)-] 



, yk, 



(4.2) 



with Rk = M — ^i^k^-i^i-- Moreover, the updates (4.2) perform better than the updates (4.1), but 
worse than the updates {2.12 2.13) which are optimal. 



Proof. This is a consequence of Theorem [s] and 10 using P = {Rk)+ and N = {Rk) 
In fact, \\M -VW\\F = \\Rk-VkWk:\\F. 



□ 
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4.2 An example 



Figure |3] shows an example of the behavior of the different algorithms: the original MU (Section 2.2 ), 
the improved version (Corollary |3]) and the optimal HALS method (Section 2.3). The test was carried 
out on a commonly used data set for NMF: the cbcl face database]^ 2429 faces (columns) consisting 
each of 19 x 19 pixels (rows) for which we set r = 40 and we used the same scaled (see Remark [s] 
below) random initialization and the same cyclic order (same as the MU i.e. first the columns of V 
then the rows of W) for the three algorithms. We observe that the MU converges significantly less 



x10 



1.4 



0.6 



0.2 




original IVIU 

improved MU 

HALS 



50 



100 
Iterations 



150 



200 



Figure 3: Comparison of the MU of Lee and Seung (2.7), the optimal rank-one NF multiplicative 



updates (4.2) and the HALS ( 2.12j|2.13 ) applied to the cbcl face database. 



rapidly than the two other algorithms. There do not seem to be good reasons to use either the MU 
or the method of Corollary [3] since there is a closed- form solution (2.12 2.13) for the corresponding 
subproblems. 

Finally, the HALS algorithm has the same computational complexity [16] and performs provably 
much better than the popular multiplicative updates of Lee and Seung. Of course, because of the 



NP-hardness of (NMF) and the existence of numerous locally optimal solutions, it is not possible to 



give a theoretical guarantee that HALS will converge to a better solution than the MU: although its 
iterations are locally more efficient, they could still end up at a worse local optimum. 



Remark 2. For r = 1, one can check that the three algorithms above are equivalent for (NMF) 



Moreover, they correspond to the power method ]13^ which converges to the optimal rank-one solution, 
given that it is initialized with a vector which is not perpendicular to the singular vector corresponding 
to the maximum singular value. 



Remark 3. We say that {V, W) is scaled if the optimal solution to the problem 

minWM - aVWWp 



(4.3) 



is equal to 1. Obviously, any stationary point is scaled ; the next Theorem is an extension of a result 
of Ho et al. 118^ . 



CBCL Face Database #1, MIT Center For Biological and Computation Learning. 
Available at http:/ /cbcl.mit.edu/cbcl/ software-datasets/ FaceData2.html. 
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Theorem 11. The following statements are equivalent 

(1) {V,W) is scaled; 

(2) VW is on the boundary of b(^^, ^\\M\\f^, the ball centered at ^ of radius ^\\M\\f; 

(3) \\M - VW\\\ = \\M\\\ - \\VW\\\ (and then \\M\\\ > \\VW\\l). 



Proof. The solution of (4.3) can be written in the fohowing closed form 



(44) 

{VW, VW) ' ^ ' 

where {A, B) = Yliij ^ij^ij = trace{AB^) is the scalar product associated with the Frobenius norm. 
Since a = 1, 

{VW-M,VW) = 

so that (1) and (2) are equivalent. For the equivalence of (1) and (3), we have 
{M-VW,M-VW) = \\M\\l-2{M,VW) + \\VW\\j, 

= \\M\\^p - \\VW\\l - 2(^{M,VW) - {VW,VW)y 
= \\M\\l-\\VW\\j, 

if and only if (M, VW) = {VW, VW) . □ 



Theorem 11 can be used as follows: when you compute the error of the current solution, you can scale 
it without further computational cost. In fact, 

\\M-VW\\l = {M -VW,M -VW) 

= \\M\\l-2{M,VW) + \\VW\\l. (4.5) 

Note that the third term of ( |4.5[ ) can be computed in 0(max(rn-, n)r^) operations since 

ij k ij ky^l 

= E ( E yi) ( E ^4) + 2 E ( E ^-^^0 ( E ^^.^'.) 

k i j ky^l i j 

= ||(y^y)o(W^)||i 

where \ \A\\i = \Aij\. This is especially interesting for sparse matrices since only a small number of 



the entries of VW (which could be dense) need to be computed to evaluate the second term of (4.5). 



5 Biclique Finding Algorithm 

In this section, an algorithm for the maximum edge biclique problem whose main iteration requires 
0(|£^|) operations is presented. It is based on the multiplicative updates for Nonnegative Factorization 
and the strong relation between these two problems (Theorems |4| |6]and[7|. We compare the results 
with other algorithms with iterates requiring Odi^l) operations using the DIMACS database and 
random graphs. 
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5.1 Description 



For d sufficiently large, stationary points of (NF-ld) are close to bicliques of (MBP) (Theorem [7|. 
Moreover, the two problems have the same cost function. One could then think of applying an 



algorithm that finds stationary points of (NF-ld) in order to localize a large biclique of the graph 
generated by Mf,. This is the idea of Algorithm |lj using the multiplicative updates (3.15) with 



P = {Md)+ = Mb and N = (Ma 



Mb). 



A priori, it is not clear what value d should take. Following the spirit of homotopy methods, we chose 
to start the algorithm with a small value of d and then to increase it until the algorithm converges to 



a bichque of (MBP). 



Algorithm 1 Biclique Finding Algorithm in 0(|i?|) operations 



Require: Mb € {0, 1}™X", v G M!p+, w € M!^+, d = do > 0, a > 1. 

1: for /c = 1,2,. . . do 
2: 

[MbW^] 



w 
d 



V o 



w o 



ad 



[v\\w\\l + (i(lm||ty||i - Mbw'^)] 

[v^Mb] 

\v\\lw + d{ln\\v\\i-vTMb)] 



(5.1) 
(5.2) 



3: end for 



We observed that initial value of d should not be chosen too large: otherwise, the algorithm often 
converges to the trivial solution: the empty biclique. In fact, in that case, the denominators in (5.1) 
and (5.2) will be large, even during the initial steps of the algorithm, and the solution is then forced 
to converge to zero. Moreover, since the denominators in (5.1) and (5.2) depend on the graph density, 
the denser the graph is, the greater do can be chosen and vice versa. On the other hand, since our 
algorithm is equivalent to the power method for d = (cf. Remark [2]), if do is chosen too small, it will 
converge to the same solution: the one initialized with the best rank-one approximation of Mb- 

For the stopping criterion, one could, for example, wait until the rounding of vw coincides with a 
feasible solution of (MBP). 



5.2 Other Algorithms in 0(|i?|) operations 

We briefly present here two other algorithms to find maximal bicliques using OdE'l) operations per 
iteration. 

5.2.1 Motzkin-Strauss Formalism 

In [12], the generalized Motzkin-Strauss formalism for cliques is extended to bicliques by defining the 
optimization problem 

max X Mb y 

where = {x £ Rl\ YJl=i a^f = 1}, = {yeWl\ YTi=i Vi = 1} and a, /3 < 1. 
Nonincreasing multiplicative updates for this problem are then provided: 

XO— — , yo > 



x^Mty; ' ' V x^M.y 
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This algorithm does not necessarily converge to a biclique: if a and (3 are not sufficiently small, it 
may converge to a dense bipartite subgraph (a bicluster). In fact, for a = /3 = 2, it converges to an 
optimal rank-one solution of the unconstrained problem as our algorithm does for d = 0. In [12], it is 
suggested to use a and (3 around 1.05. Finally, a / /? will favor one side of the biclique. We will use 
a = p. 

5.2.2 Greedy Heuristic 

The simplest heuristic one can imagine is to add, at each step, a vertex which is connected to most 
vertices in the other side of the bipartite graph. Each time a vertex is selected, the next choices 
are restricted in order to get a biclique eventually: the vertices which are not connected to the one 
you have just chosen are deleted. The procedure is repeated on the remaining graph until you get a 
biclique. One can check that this produces a maximal biclique. 

5.3 Results 

We first present some results for graphs from the DIMACS graph dataselj^ We extracted bicliques in 
those (not bipartite) graphs using the preceding algorithms. We performed 100 runs, 200 iterations 
each, for the two algorithms with the same initializations. We tried to choose appropriate parame- 
ter^ for both algorithms. Table [l] displays the cardinality of the biclique extracted by the different 
algorithms. 

Table [2] shows the results for random graphs: we have generated randomly 100 graphs with 100 vertices 
for different densities (the probability of an edge to belong to the graph is equal to the density) . The 
average numbers of edges in the solutions for the different algorithms are displayed for each density. 
We kept the same configuration as for the DIMACS graphs (same initializations, 100 runs for each 
graph, 200 iterations). It seems that the multiplicative updates generates, in general, better solutions, 
especially when dealing with dense graphs. The algorithm based on the Motzkin-Strauss formalism 
seems less efficient and is more sensitive to the choice of its parameters. 





size 


Greedy 


M. 


-S. 


Mult. 




(m) 




(a = 


1.01) 


{d = l,a = 1.1) 








mean 


best 


mean 


best 


ham62 


64 


304 


157 


225 


269 


320 


ham64 


64 


42 


21 


36 


37 


42 


ham82 


256 


4672 


2839 


3920 


4569 


4770 


ham84 


256 


440 


226 


506 


830 


1015 


john824 


28 


36 


26 


36 


28 


36 


john844 


70 


182 


132 


225 


220 


225 


johnl624 


120 


784 


457 


700 


514 


675 


john3224 


496 


14400 


6294 


9129 


8722 


9108 


MANN a9 


45 


289 


272 


342 


342 


342 


MANN a27 


378 


28728 


15946 


28875 


30800 


30800 



Table 1: Solutions for DIMACS data: number of edges in the bicliques. 



''ftp://dimacs.rutgers.edu/pub/challenge/graph/benchmarks/clique. 

*For the M.-S. algorithm: the alternative choices of parameters a = 1.005 and a = 1.05 for the DIMACS graphs and 
a = 1.005 and a — 1.1 for the random graphs were tested and all gave worse results. Small changes to the parameters 
of the Mult, algorithm led to similar results, so that it seems less sensitive to the choice of its parameters than the M.-S. 
algorithm. 
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density 


Greedy 


M.-S. 


M.-S. 


Mult. 






(a = 


1.01) 


(q = 


1.05) 


{d = l, a = 1.1) 






mean 


best 


mean 


best 


mean 


best 


0.1 


8.9 


10.4 


18.9 


14.4 


19.2 


14.4 


19.2 


0.2 


15.8 


14.5 


29.8 


23.7 


31.5 


23.9 


31.5 


0.3 


24.2 


18.9 


38.7 


33.8 


43.4 


34.1 


43.3 


0.4 


37.7 


24.0 


51.2 


46.9 


61.3 


47.0 


61.0 


0.5 


58.8 


31.7 


69.0 


65.8 


86.7 


67.6 


87.0 


0.6 


92.5 


43.4 


93.7 


90.5 


127.9 


101.7 


127.8 


0.7 


158.4 


63.6 


133.4 


117.9 


190.0 


172.2 


202.4 


0.8 


311.0 


106.0 


207.1 


154.4 


261.5 


328.0 


342.3 


0.9 


806.8 


241.9 


431.3 


88.1 


235.4 


828.1 


828.1 



Table 2: Solutions for random graphs: average number of edges in the bicliques. 



Remark 4. Algorithm [7] enjoys some flexibility: 

• It is applicable to non-binary matrices i.e. weighted graphs. 

• It is possible to favor one side of the biclique. In fact, the multiplicative updates for NF can 



be adapted using the same developments as in Section 3.3 to cost functions with regularization 
terms, e.g. 



min ||M — + Q!||'u||2 + /^ll^i'lb- 

v,w>0 

If d is kept sufficiently small, for example replacing d = ad by d = mm(ad, dm) for some 
dm > 0, there is no guarantee that the algorithm will converge to a biclique. However, the 



negative entries in will enforce the corresponding entries of the solutions of (NF-ld) to 
be small (recall that Theorem^ states that, for d sufficiently large, they will be equal to zero). 
Therefore, by rounding these solutions, instead of a biclique, one gets a dense submatrix of 
i.e. a bicluster. Algorithm^ can then be used as a biclustering algorithm. The density of the 
corresponding submatrix will depend on the choice of dm- Table gives an example of such 
behavior. 



dm 


0.01 


0.05 


0.1 


0.5 


1 


1.5 


size 


5412 


4428 


2952 


1073 


595 


539 


density 


29% 


31% 


35% 


42% 


51% 


52% 



Table 3: Biclusters for the 'classic' text mining dataset (7094 texts and 41681 words with more than 
99.9% of entries equal to zero) with parameters do = 10~^, a = 1.025, maxiter = 500. 



6 Conclusion 

We have introduced Nonnegative Factorization (NF), a new variant of Nonnegative Matrix Factoriza- 
tion (NMF) , and proved its NP-hardness for any fixed rank by reduction to the maximum edge biclique 
problem. The multiplicative updates for NMF can be generalized to NF and provide a new interpre- 
tation of the algorithm of Lee and Seung, which explains why it does not perform well in practice. 
We also developed an heuristic algorithm for the biclique problem whose iterations require Odi?!) 
operations, based on theoretical results about stationary points of a specific rank-one nonnegative 
factorization problem (NF-ld) and the use of multiplicative updates. 
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To conclude, we point out that none of the algorithms presented in this paper is guaranteed to 
converge to a globally optimal solution (and, to the best of our knowledge, such an algorithm has not 
been proposed yet) ; this is in all likelihood due to the NP-hardness of the NMF and NF problems. 
Indeed, only convergence to a stationary point has been proved for the algorithms of Sections |2] and [3j 
a property which, while desirable, provides no guarantee about the quality of the solution obtained 
(for example, nothing prevents these methods from converging to a stationary but rank-deficient so- 
lution, which in most cases could be further improved). Finally, no convergence proof for the biclique 
finding algorithm introduced in Section |5] is provided (convergence results from the preceding sections 
no longer hold because of the dynamic updates of parameter d) ; however, this heuristic seems to give 
very satisfactory results in practice. 

Acknowledgment. We thank Pr. Paul Van Dooren and Pr. Laurence Wolsey for helpful discussions 
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